18 research outputs found
SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression
Neural sequence-to-sequence models are currently the dominant approach in
several natural language processing tasks, but require large parallel corpora.
We present a sequence-to-sequence-to-sequence autoencoder (SEQ^3), consisting
of two chained encoder-decoder pairs, with words used as a sequence of discrete
latent variables. We apply the proposed model to unsupervised abstractive
sentence compression, where the first and last sequences are the input and
reconstructed sentences, respectively, while the middle sequence is the
compressed sentence. Constraining the length of the latent word sequences
forces the model to distill important information from the input. A pretrained
language model, acting as a prior over the latent sequences, encourages the
compressed sentences to be human-readable. Continuous relaxations enable us to
sample from categorical distributions, allowing gradient-based optimization,
unlike alternatives that rely on reinforcement learning. The proposed model
does not require parallel text-summary pairs, achieving promising results in
unsupervised sentence compression on benchmark datasets.Comment: Accepted to NAACL 201
Language Model Prior for Low-Resource Neural Machine Translation
The scarcity of large parallel corpora is an important obstacle for neural
machine translation. A common solution is to exploit the knowledge of language
models (LM) trained on abundant monolingual data. In this work, we propose a
novel approach to incorporate a LM as prior in a neural translation model (TM).
Specifically, we add a regularization term, which pushes the output
distributions of the TM to be probable under the LM prior, while avoiding wrong
predictions when the TM "disagrees" with the LM. This objective relates to
knowledge distillation, where the LM can be viewed as teaching the TM about the
target language. The proposed approach does not compromise decoding speed,
because the LM is used only at training time, unlike previous work that
requires it during inference. We present an analysis of the effects that
different methods have on the distributions of the TM. Results on two
low-resource machine translation datasets show clear improvements even with
limited monolingual data
When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale
Multilingual machine translation (MMT), trained on a mixture of parallel and
monolingual data, is key for improving translation in low-resource language
pairs. However, the literature offers conflicting results on the performance of
different methods of including monolingual data. To resolve this, we examine
how denoising autoencoding (DAE) and backtranslation (BT) impact MMT under
different data conditions and model scales. Unlike prior studies, we use a
realistic dataset of 100 translation directions and consider many domain
combinations of monolingual and test data. We find that monolingual data
generally helps MMT, but models are surprisingly brittle to domain mismatches,
especially at smaller model scales. BT is beneficial when the parallel,
monolingual, and test data sources are similar but can be detrimental
otherwise, while DAE is less effective than previously reported. Next, we
analyze the impact of scale (from 90M to 1.6B parameters) and find it is
important for both methods, particularly DAE. As scale increases, DAE
transitions from underperforming the parallel-only baseline at 90M to
converging with BT performance at 1.6B, and even surpassing it in low-resource.
These results offer new insights into how to best use monolingual data in MMT.Comment: Work in progres